skip to main content
10.1145/3519939.3523701acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections

All you need is superword-level parallelism: systematic control-flow vectorization with SLP

Published:09 June 2022Publication History

ABSTRACT

Superword-level parallelism (SLP) vectorization is a proven technique for vectorizing straight-line code. It works by replacing independent, isomorphic instructions with equivalent vector instructions. Larsen and Amarasinghe originally proposed using SLP vectorization (together with loop unrolling) as a simpler, more flexible alternative to traditional loop vectorization. However, this vision of replacing traditional loop vectorization has not been realized because SLP vectorization cannot directly reason with control flow.

In this work, we introduce SuperVectorization, a new vectorization framework that generalizes SLP vectorization to uncover parallelism that spans different basic blocks and loop nests. With the capability to systematically vectorize instructions across control-flow regions such as basic blocks and loops, our framework simultaneously subsumes the roles of inner-loop, outer-loop, and straight-line vectorizer while retaining the flexibility of SLP vectorization (e.g., partial vectorization).

Our evaluation shows that a single instance of our vectorizer is competitive with and, in many cases, significantly better than LLVM’s vectorization pipeline, which includes both loop and SLP vectorizers. For example, on an unoptimized, sequential volume renderer from Pharr and Mark, our vectorizer gains a 3.28× speedup, whereas none of the production compilers that we tested vectorizes to its complex control-flow constructs.

References

  1. 2022. Auto-Vectorization in GCC. https://gcc.gnu.org/projects/tree-ssa/vectorization.htmlGoogle ScholarGoogle Scholar
  2. 2022. Auto-Vectorization in LLVM. https://llvm.org/docs/Vectorizers.htmlGoogle ScholarGoogle Scholar
  3. 2022. llvm::TargetTransformInfo Class Reference. https://llvm.org/doxygen/classllvm_1_1TargetTransformInfo.htmlGoogle ScholarGoogle Scholar
  4. Randy Allen and Ken Kennedy. 1987. Automatic Translation of FORTRAN Programs to Vector Form. ACM Transactions on Programming Languages and Systems.Google ScholarGoogle Scholar
  5. Randy Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of Control Dependence to Data Dependence. In Symposium on Principles of Programming Languages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. FlexVec: Auto-vectorization for Irregular Loops. In Programming Language Design and Implementation.Google ScholarGoogle Scholar
  7. Bob Blainey, Christopher Barton, and José Nelson Amaral. 2002. Removing impediments to loop fusion through code transformations. In International Workshop on Languages and Compilers for Parallel Computing.Google ScholarGoogle Scholar
  8. David Callahan, Jack J Dongarra, and David Levine. 1988. Vectorizing Compilers: A Test Suite and Results. In ACM/IEEE Conference on Supercomputing.Google ScholarGoogle Scholar
  9. Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Transactions on Programming Languages and Systems.Google ScholarGoogle Scholar
  10. Tobias Grosser, Armin Größ linger, and Christian Lengauer. 2012. Polly – Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters.Google ScholarGoogle Scholar
  11. Khronos Group. 2009. OpenCL 1.0 Specification. http://khronos.org/registry/cl/specs/opencl-1.0.pdfGoogle ScholarGoogle Scholar
  12. Ralf Karrenberg and Sebastian Hack. 2011. Whole Function Vectorization. In International Symposium on Code Generation and Optimization.Google ScholarGoogle Scholar
  13. Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In International Workshop on Languages and Compilers for Parallel Computing. 301–320.Google ScholarGoogle Scholar
  14. Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Programming Language Design and Implementation.Google ScholarGoogle Scholar
  15. Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization.Google ScholarGoogle Scholar
  16. Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In Programming Language Design and Implementation.Google ScholarGoogle Scholar
  17. Charith Mendis and Saman Amarasinghe. 2018. goSLP: Globally Optimized Superword Level Parallelism Framework. Proceedings of the ACM on Programming Languages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Simon Moll and Sebastian Hack. 2018. Partial Control-Flow Linearization. In Programming Language Design and Implementation.Google ScholarGoogle Scholar
  19. Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of Interleaved Data for SIMD. In Programming Language Design and Implementation.Google ScholarGoogle Scholar
  20. Dorit Nuzman and Ayal Zaks. 2008. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Karl J. Ottenstein, Robert A. Ballance, and Arthur B. MacCabe. 1990. The Program Dependence Web: A Representation Supporting Control-, Data-, and Demand-Driven Interpretation of Imperative Languages. In Programming Language Design and Implementation.Google ScholarGoogle Scholar
  22. Joseph CH Park and Mike Schlansker. 1991. On predicated execution.Google ScholarGoogle Scholar
  23. Matt Pharr and William R. Mark. 2012. ispc: A SPMD Compiler for High-Performance CPU Programming. In Innovative Parallel Computing.Google ScholarGoogle Scholar
  24. Vasileios Porpodas and Timothy M. Jones. 2015. Throttling Automatic Vectorization: When Less is More. In Conference on Parallel Architecture and Compilation.Google ScholarGoogle Scholar
  25. Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In International Symposium on Code Generation and Optimization.Google ScholarGoogle Scholar
  26. Vasileios Porpodas, Rodrigo CO Rocha, and Luís FW Góes. 2018. VW-SLP: auto-vectorization with adaptive vector width. In International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Vasileios Porpodas, Rodrigo C. O. Rocha, Evgueni Brevnov, Luís F. W. Góes, and Timothy Mattson. 2019. Super-Node SLP: Optimized Vectorization for Code Sequences Containing Operators and Their Inverse Elements. In International Symposium on Code Generation and Optimization.Google ScholarGoogle Scholar
  28. Louis-Noël Pouchet. 2021. PolyBench/C: the polyhedral benchmark suite. https://web.cse.ohio-state.edu/ pouchet.2/software/polybench/Google ScholarGoogle Scholar
  29. Rodrigo C. O. Rocha, Vasileios Porpodas, Pavlos Petoumenos, Luís F. W. Góes, Zheng Wang, Murray Cole, and Hugh Leather. 2020. Vectorization-Aware Loop Unrolling with Seed Forwarding. In International Conference on Compiler Construction.Google ScholarGoogle Scholar
  30. Ira Rosen, Dorit Nuzman, and Ayal Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers Summit.Google ScholarGoogle Scholar
  31. Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In International Symposium on Code Generation and Optimization.Google ScholarGoogle Scholar
  32. Jean-Baptiste Tristan, Paul Govereau, and Greg Morrisett. 2011. Evaluating Value-Graph Translation Validation for LLVM. In Programming Language Design and Implementation.Google ScholarGoogle Scholar
  33. Peng Tu and David Padua. 1995. Efficient Building and Placing of Gating Functions. In Programming Language Design and Implementation.Google ScholarGoogle Scholar

Index Terms

  1. All you need is superword-level parallelism: systematic control-flow vectorization with SLP

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader